GLIMPSEINDEX
Section: Misc. Reference Manual Pages (l)
Updated: October 11, 1995
Index
Return to Main Contents
NAME
glimpseindex 3.0 - index whole file systems to be searched by glimpse
OVERVIEW
Glimpse (which stands for GLobal IMPlicit SEarch)
is an indexing and query system that allows you to search through
all your files very quickly.
Glimpseindex is the indexing program for glimpse.
Glimpse supports most of agrep's options
(agrep is our powerful version of grep)
including approximate matching (e.g., finding misspelled words),
Boolean queries, and even some limited forms of regular expressions.
It is used in the same way, except that you don't have to
specify file names.
So, if you are looking for a needle
anywhere in your file system, all you have to do is say
glimpse needle
and all lines containing needle will appear preceded
by the file name.
See man glimpse for details on how to use glimpse.
Glimpseindex provides three indexing options: a tiny index (2-3% of
the total size of all files), a small index (7-8%) and a medium-size
index (20-30%). Search times are normally better with larger indexes.
To index all your files, you say
glimpseindex ~
for tiny index (where ~ stands for the home directory),
glimpseindex -o ~
for small index, and
glimpseindex -b ~
for medium.
Mail glimpse-request@cs.arizona.edu to be added to the glimpse mailing list.
Mail glimpse@cs.arizona.edu to report bugs, ask questions, discuss tricks
for using glimpse, etc. (this is a moderated mailing list with very little
traffic, mostly announcements).
HTML version of these manual pages can be found in
http://glimpse.cs.arizona.edu:1994/glimpseindexhelp.html
Also, see the glimpse developers home page in
http://glimpse.cs.arizona.edu:1994/
SYNOPSIS
glimpseindex
[
-abEfFiInos -w number -dD filename(s) -H directory
-M number -S number
]
directory_name[s]
INTRODUCTION
Glimpseindex
builds an index of all text files in all
the directories specified and all their subdirectories (recursively).
It is also possible to build several separate indexes (possibly
even overlapping).
The simplest way to index your files is to say
glimpseindex ~
The index consists of several files (described in detail below),
all with the prefix .glimpse_ stored in the user's home directory
(unless otherwise specified with the -H option).
Files with one of the following suffixes
are not indexed: ".o", ".gz", ".Z", ".z", ".hqx", ".zip", ".tar".
(Unless the -z option is used, see below.)
In addition, glimpseindex attempts to determine whether a file
is a text file and does not index files that it thinks are not text files.
Numbers are not indexed unless the -n option is used.
It is possible to prevent specified files from being
indexed by adding their names to the .glimpse_exclude file (described below).
The -o option builds a larger index (typically by a factor of 2-3),
allowing for a faster search (1-5 times faster).
The -b builds an even larger index and allows an even faster search.
There is an incremental indexing option -f, which updates an
existing index by determining which files
have been created or modified since the index was built and
adding them to the index (see -f).
Glimpseindex is reasonably fast, taking about 20 minutes to index
100MB from scratch (on a SUN Sparc 5) and 2-4 minutes
to update an existing index. (Your mileage may vary.)
It is also possible to increment the index by adding
a specific file (the -a option).
Once an index is built, searching for pattern is as easy as saying
glimpse pattern
(See man glimpse for all glimpse's options and features.)
A DETAILED DESCRIPTION OF GLIMPSEINDEX
Glimpse does not automatically index files. You have to tell it to do
it. This can be done manually, but a better way is to set it to run
every night. It is probably a good idea to run glimpseindex manually
for the first time to be sure it works properly.
The following is a simple script to run glimpseindex every night.
We assume that this script is stored in a file called glimpse.script:
glimpseindex -w 1000 ~ >& .glimpse_out
at -m 0300 glimpse.script
(It might be interesting to collect all the outputs of glimpse by
changing >& to >>& so that the file .glimpse_out maintains a history.
In this case the file must be created before the first time >>& is used.
If you use ksh, replace '>&' with '2>&1'.)
Glimpseindex stores the names of all the files that it indexed
in the file .glimpse_filenames.
Each file is listed by its full path name as obtained at the time
the files were indexed.
For example, /usr1/udi/file1.
Glimpse uses this full name when it performs the search, so the name
must match the current name.
This may become a problem when the indexing and the search
are done from different machines (e.g., through NFS), which may cause
the path names to be different.
For example, /tmp_mnt/R/xxx/xxx/usr1/udi/file1.
(The same is true for several other .glimpse files. See below.)
Glimpseindex does not follow symbolic links unless they are
explicitly included in the .glimpse_include file (described below).
Glimpseindex makes an effort to identify non-text files such as
binary files, compressed files, uuencoded files, postscript files,
binhex files, etc.
These files are automatically not indexed.
In addition, all files whose names end with `.o', `.gz', `.Z', `.z',
`.hqx', `.zip', or `.tar'
will not be indexed (unless they are specifically included
in .glimpse_include - see below).
The options for glimpseindex are as follows:
- -a
-
adds the given file[s] and/or directories to an existing index.
Any given directory will be traversed recursively and all files will
be indexed (unless they appear in .glimpse_exclude; see below).
Using this option is generally much faster than indexing everything
from scratch, although in rare cases the index may not be as good.
If for some reason the index is full (which can happen
unless -o or -b are used)
glimpseindex -a will produce
an error message and will exit without changing the original index.
- -b
-
builds a medium-size index (20-30% of the size of all files),
allowing faster search. This option forces glimpseindex to store
an exact (byte level) pointer to each occurrence of each word
(except for some very common words
belonging to the stop list).
- -B
-
uses a hash table that is 4 times bigger (256k entries instead of 64K)
to speed up indexing.
The memory usage will increase typically by about 2 MB.
This option is only for indexing speed; it does not affect the final
index.
- -d filename(s)
-
deletes the given file(s) from the index.
- -D filename(s)
-
deletes the given file(s) from the list of file names, but not
from the index. This is much faster than -d, and the file(s) will
not be found by glimpse. However, the index itself will not become
smaller.
- -E
-
does not run a check on file types. Glimpse normally attempts to
exclude non-text files, but this attempt is not always perfect.
With -E, glimpseindex indexes
all files, except those that are specifically excluded in .glimpse_exclude
and those whose file names end with one of the excluded suffixes.
- -f
-
incremental indexing. glimpseindex scans all files
and adds to the index only those files that were created or modified
after the current index was built.
If there is no current index or if this procedure fails, glimpseindex
automatically reverts to the default mode
(which is to index everything from scratch).
This option may create an inefficient index for several reasons,
one of which is that deleted files are not really deleted from the index.
Unless changes are small, mostly additions, and -o is used,
we suggest to use the default mode as much as possible.
- -F
-
Glimpseindex receives the list of files to index from standard input.
- -H directory
-
Put or update the index and all other .glimpse files (listed below)
in "directory".
The default is the home directory.
When glimpse is run, the -H option must be used to direct glimpse to this
directory, because glimpse assumes that the index is in the home
directory (see also the -H option in glimpse).
- -i
-
Make .glimpse_include (SEE GLIMPSEINDEX FILES) take precedence
over .glimpse_exclude,
so that, for example, one can exclude everything (by putting *)
and then explicitly include files.
- -I
-
Instead of indexing, only show (print to standard out)
the list of files that would be indexed.
It is useful for filtering purposes.
("glimpseindex -I dir | glimpseindex -F" is the same as
"glimpseindex dir".)
- -M x
-
Tells glimpseindex to use x MB of memory for temporary tables.
The more memory you allow the faster glimpseindex will run.
The default is x=2.
The value of x must be a positive integer.
Glimpseindex will need more memory than x for other things, and
glimpseindex may perform some 'forks', so you'll
have to experiment if you want to use this option.
WARNING:
If x is too large you may run out of swap space.
- -n
-
Index numbers as well as text. The default is not to index numbers.
This is useful when searching for dates or other identifying numbers,
but it may make the index very large if there are lots of numbers.
In general, glimpseindex strips away any non-alphabetic character.
For example, the string abc123 will be indexed as abc if the -n option
is not used and as abc123 if it is used.
Glimpse provides warnings (in .glimpse_messages) for all files
in which more than half the words that were added to the index
from that file had digits in them (this is an attempt to identify
data files that should probably not be indexed).
One can use the .glimpse_exclude file to exclude data files or any
other files.
(See GLIMPSEINDEX FILES.)
- -o
-
Build a small index rather than tiny
(meaning 7-9% of the sizes of all files - your mileage may vary)
allowing faster search. This option forces glimpseindex to allocate
one block per file (a block usually contains many files).
A detailed explanation of how blocks affect glimpse can be
found in the glimpse article.
(See also LIMITATIONS.)
- -s
-
supports structured queries. This option was added to support the
Harvest project and it is applicable mostly in that context.
See STRUCTURED QUERIES below for more information and also
http://harvest.cs.colorado.edu for more information
about the Harvest project.
- -S k
-
The number k determines the size of the stop-list.
The stop-list consists of words that are too common and are not indexed
(e.g., 'the' or 'and').
Instead of having a fixed stop-list, glimpseindex figures out the
words that are too common for every index separately.
The rules are different for the different indexing options.
The tiny index contains all words (the savings from a stop-list are
too small to bother).
The small index (-o), the number k is a percentage threshold.
A word will be in the stop list if it appears in at least k% of all files.
The default value is 80%.
(If there are less than 256 files, then the stop-list is not maintained.)
The medium index (-b) counts all occurrences of all words, and a word
is added to the stop-list if it appears at least k times per MByte.
The default value is 500.
A query that includes a stop list word is of course less efficient.
(See also LIMITATIONS below.)
- -w k
-
Glimpseindex does a reasonable, but not a perfect, job of determining
which files should not be indexed.
Sometimes a large text file should not be indexed; for
example, a dictionary may match most queries.
The -w option stores in a file called .glimpse_messages (in the same
directory as the index) the list of all files that contribute
at least k new words to the index. The user can look at this list
of files and decide which should or should not be indexed.
The file .glimpse_exclude contains files that will not be indexed
(see more below). We recommend to set k to about 1000.
This is not an exact measure. For example, if the same file appears
twice, then the second copy will not contribute any new words
to the dictionary (but if you exclude the first copy and index again,
the second copy will contribute).
- -z
-
Allow customizable filtering, using the file .glimpse_filters
to perform the programs listed there for each match. The best example is
compress/decompress. If .glimpse_filters include the line
*.Z uncompress <
(separated by tabs)
then before indexing any file that matches the pattern "*.Z" (same
syntax as the one for .glimpse_exclude) the command listed is
executed first (assuming input is from stdin, which is why uncompress
needs <) and its output (assuming it goes to stdout) is indexed.
The file itself is not changed (i.e., it stays compressed).
Then if glimpse -z is used, the same program is used on these files
on the fly. Any program can be used (we run 'exec'). For example,
one can filter out parts of files that should not be indexed.
Glimpseindex tries to apply all filters in .glimpse_filters in the
order they are given.
For example, if you want to uncompress a file and then extract
some part of it, put the compression command (the example above)
first and then another line that specifies the extraction.
Note that this can slow down the search because the filters need to
be run before files are searched.
GLIMPSEINDEX FILES
All files used by glimpse are located at the directory(ies) where
the index(es) is (are) stored and have .glimpse_ as a prefix.
The first two files (.glimpse_exclude and .glimpse_include) are
optionally supplied by the user. The other files are built and
read by glimpse.
- .glimpse_exclude
-
contains a list of files that glimpseindex is explicitly told to ignore.
In general, the syntax of .glimpse_exclude/include is the same as
that of agrep (or any other grep). The lines in the .glimpse_exclude
file are matched to the file names, and if they match, the files
are excluded. Notice that agrep matches to parts of the string!
e.g., agrep /ftp/pub will match /home/ftp/pub and /ftp/pub/whatever.
So, if you want to exclude /ftp/pub/core, you just list
it, as is, in the .glimpse_exclude file.
If you put "/home/ftp/pub/cdrom" in .glimpse_exclude, every file
name that matches that string will be excluded, meaning all files
below it.
You can use ^ to indicate the beginning of a file name, and $ to
indicate the end of one, and you can use * and ? in the usual way.
For example /ftp/*html will exclude /ftp/pub/foo.html, but will
also exclude /home/ftp/pub/html/whatever; if you want to exclude
files that start with /ftp and end with html use ^/ftp*html$
Notice that putting a * at the beginning or at the end is redundant
(in fact, in this case glimpseindex will remove the * when it
does the indexing).
No other meta characters are allowed in .glimpse_exclude
(e.g., don't use .* or # or |).
Lines with * or ? must have no more than 30 characters.
Notice that, although the index itself will not be indexed,
the list of file names (.glimpse_filenames) will be indexed
unless it is explicitly listed in .glimpse_exclude.
- .glimpse_filters
-
See the description above for the -z option.
- .glimpse_include
-
contains a list of files that glimpseindex
is explicitly told to include in the index even though they may look
like non-text files. Symbolic links are followed by glimpseindex
only if they are specifically included here.
The syntax is the same as the one for .glimpse_exclude (see there).
If a file is in both .glimpse_exclude and .glimpse_include it will be
excluded unless -i is used.
- .glimpse_filenames
-
contains the list of all indexed file names, one per line.
This is an ASCII file that can also be used with agrep to search
for a file name leading to a fast find command.
For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files that have 'count' in
their name (including anywhere on the path from the index).
Setting the following alias in the .login file may be useful:
alias findfile 'glimpse -h :1 ~/.glimpse_filenames'
- .glimpse_index
-
contains the index. The index consists of lines, each starting with a
word followed by a list of block numbers (unless the -o or -b options
are used, in which case each word is followed by an offset into
the file .glimpse_partitions where all pointers are kept).
The block/file numbers are stored in binary form, so this is not an ASCII file.
- .glimpse_messages
-
contains the output of the -w option (see above).
- .glimpse_partitions
-
contains the partition of the indexed space into blocks
and, when the index is built with the -o or -b options, some part of the
index. This file is used internally by glimpse and it is
a non-ASCII file.
- .glimpse_statistics
-
contains some statistics about the makeup of the index. Useful for
some advanced applications and customization of glimpse.
STRUCTURED QUERIES
Glimpse can search for Boolean combinations of "attribute=value" terms
by using the Harvest SOIF parser library (in glimpse/libtemplate).
To search this way, the index must be made by using the -s option of
glimpseindex (this can be used in conjunction with other glimpseindex
options). For glimpse and glimpseindex to recognize "structured" files,
they must be in SOIF format. In this format, each value is prefixed by
an attribute-name with the size of the value (in bytes) present in "{}"
after the name of the attribute.
For example, The following lines are part of an SOIF file:
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
Any string can serve as an attribute name.
Glimpse "pattern;type=Directory-Listing" will search for "pattern"
only in files whose type is "Directory-Listing".
The file itself is considered to be
one "object" and its name/url appears as the first attribute with an
"@" prefix; e.g.,
@FILE { http://xxx... }
The scope of Boolean operations changes from records
(lines) to whole files when structured queries are used in glimpse
(since individual query terms can look at different attributes and they
may not be "covered" by the record/line). Note that glimpse can only
search for patterns in the value parts of the SOIF file: there are some
attributes (like the TTL, MD5, etc.) that are interpreted by Harvest's
internal routines.
See http://harvest.cs.colorado.edu/harvest/user-manual/ for more detailed
information of the SOIF format.
HOW TO DETERMINE THE INDEX TYPE
If you want to determine the type of an existing index,
check the first 3 lines of the file ".glimpse_index"
(which can be obtained by
running "head -3 .glimpse_index").
These lines always begin with "%".
If the first line has the string "1234567890" after the "%", it means that
numbers were indexed (glimpseindex -n);
otherwise, it means that numbers were not indexed.
If the second line has a 0 after the "%", then a tiny (default) index was
created by glimpseindex;
if there is a negative integer after the "%", then a medium
sized index was created (glimpseindex -b);
if there is a positive integer
after the "%", then a small index was created (glimpseindex -o).
In the latter two cases,
the absolute value of the integer tells you the number of
files that were indexed. On the third line, if the "-s" option of
glimpseindex was used to build an index for structured queries,
the positive integer
after the "%" tells you the number of attributes that were found; if not,
the third line just contains a "%0".
REFERENCES
- 1.
-
U. Manber and S. Wu,
"GLIMPSE: A Tool to Search Through Entire File Systems,"
Usenix Winter 1994 Technical Conference,
San Francisco (January 1994), pp. 23-32.
Also, Technical Report #TR 93-34, Dept. of Computer Science,
University of Arizona, October 1993 (a postscript file
is available by anonymous ftp at
cs.arizona.edu:reports/1993/TR93-34.ps).
- 2.
-
S. Wu and U. Manber,
"Fast Text Searching Allowing Errors,"
Communications of the ACM
35 (October 1992), pp. 83-91.
SEE ALSO
agrep(1),
ed(1),
ex(1),
glimpse(1),
glimpseserver(1),
grep(1V),
sh(1),
csh(1).
LIMITATIONS
The index of glimpse is word based. A pattern that contains more than
one word cannot be found in the index. The way glimpse overcomes this
weakness is by splitting any multi-word pattern into its set of words
and looking for all of them in the index.
For example, glimpse 'linear programming' will first consult the index
to find all files containing both linear and programming,
and then apply agrep to find the combined pattern.
This is usually an effective solution, but it can be slow for
cases where both words are very common, but their combination is not.
The index of glimpse stores all patterns in lower case.
When glimpse searches the index it first converts
all patterns to lower case, finds the appropriate files,
and then searches the actual files using the original
patterns.
So, for example, glimpse ABCXYZ will first find all
files containing abcxyz in any combination of lower and upper
cases, and then searches these files directly, so only the
right cases will be found.
One problem with this approach is discovering misspellings
that are caused by wrong cases.
For example, glimpse -B abcXYZ will first search the
index for the best match to abcxyz (because the pattern is
converted to lower case); it will find that there are matches
with no errors, and will go to those files to search them
directly, this time with the original upper cases.
If the closest match is, say AbcXYZ, glimpse may miss it,
because it doesn't expect an error.
Another problem is speed. If you search for "ATT", it will look
at the index for "att". Unless you use -w to match the whole word,
glimpse may have to search all files containing, for example, "Seattle"
which has "att" in it.
There is no size limit for simple patterns and simple patterns
with Boolean AND.
More complicated patterns
are currently limited to approximately 30 characters.
Lines are limited to 1024 characters.
Records are limited to 48K, and may be truncated if they are larger
than that.
The limit of record length can be
changed by modifying the parameter Max_record in agrep.h.
Each line in .glimpse_exclude or .glimpse_include that contains
a * or a ? must not exceed 30 characters length.
Glimpseindex does not index words of size > 64.
A medium-size index (-b) may lead to actually slower query times
if the files are all very small.
Under -b, it may be impossible to make the stop list empty.
Glimpseindex is using the "sort" routine, and all occurrences
of a word appear at some point on one line.
Sort is limiting the size of lines it can handle (the value depends
on the platform; ours is 16KB).
If the lines are too big, the word is added to the stop list.
BUGS
Please send bug reports or comments to glimpse@cs.arizona.edu.
AUTHORS
Udi Manber and Burra Gopal, Department of Computer Science,
University of Arizona, and Sun Wu, the National Chung-Cheng University,
Taiwan. (Email: glimpse@cs.arizona.edu)
Index
- NAME
-
- OVERVIEW
-
- SYNOPSIS
-
- INTRODUCTION
-
- A DETAILED DESCRIPTION OF GLIMPSEINDEX
-
- GLIMPSEINDEX FILES
-
- STRUCTURED QUERIES
-
- HOW TO DETERMINE THE INDEX TYPE
-
- REFERENCES
-
- SEE ALSO
-
- LIMITATIONS
-
- BUGS
-
- AUTHORS
-
This document was created by
man2html,
using the manual pages.
Time: 23:10:23 GMT, November 27, 2022